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Differential privacy is a modern approach in privacy-preserving data analysis to control the amount 
of information that can be inferred about an individual by querying a database. The most common 
techniques are based on the introduction of probabilistic noise, often defined as a Laplacian para- 
metric on the sensitivity of the query. In order to maximize the utility of the query, it is crucial to 
estimate the sensitivity as precisely as possible. 

In this paper we consider relational algebra, the classical language for queries in relational databases, 
and we propose a method for computing a bound on the sensitivity of queries in an intuitive and 
compositional way. We use constraint-based techniques to accumulate the information on the possi- 
ble values for attributes provided by the various components of the query, thus making it possible to 
compute tight bounds on the sensitivity. 

1 Introduction 

Differential privacy JHITlllllH is a recent approach addressing the privacy of individuals in data analysis 
on statistical databases. In general, statistical databases are designed to collect global information in 
some domain of interest, while the information about the particular entries is supposed to be kept con- 
fidential. Unfortunately, querying a database might leak information about an individual, because the 
presence of her record may induce the query to return a different result. 

To illustrate the problem, consider for instance a database of people affected by a certain disease, con- 
taining data such as age, height, etc. Usually the identity of the people present in the database is supposed 
to be secret, but if we are allowed to query the database for the number of records which are contained 
in it, and for - say - the average value of the data (height, age, etc.), then one can infer the precise data 
of the last person entry in the database, which poses a serious threat to the disclosure of her identity as 
well. 

To avoid this problem, one of the most commonly used methods consists in introducing some noise on 
the answer. In other words, instead of giving the exact answer the curator gives an approximated answer, 
chosen randomly according to some probability distribution. 

Differential privacy measures the level of privacy provided by such a randomized mechanism by a pa- 
rameter e: a mechanism Ki is £-differentially private if for every pair of adjacent databases R and R' (i.e. 
databases which which differ for only one entry), and for every property fP, the probabilities that ^(/?) 
and 'KiR') satisfy !P differ at most by the multiplicative constant e^. 

The amount of noise that the mechanism must introduce in order to achieve e differential privacy depends 
on the so-called sensitivity of the query, namely the maximum distance between the answers on two 
adjacent databases. For instance, one of the most commonly used mechanisms, the Laplacian, adds noise 
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to the correct answer y by reporting an approximated answer z according to the following probability 
density function: 

Py{z)=ce ^f' 

where Af is the sensitivity of the query /, and c is a normalization factor. Clearly, the higher is the 
sensitivity, the greater the noise, in the sense that the above function is more "flat", i.e. we get a higher 
probability of reporting an answer very different from the exact one. 

Of course, there is a trade off between the privacy and the utility of a mechanism: the more noise a 
mechanism adds, the less precise the reported answer, which usually means that the result of querying 
the database becomes less useful - whatever the purpose. 

For this reason, it is important to avoid adding excessive noise: one should add only the noise strictly 
necessary to achieve the desired level of differential privacy. This means that the sensitivity of the query 
should be computed as precisely as possible. At the same time, for the sake of efficiency it is desirable 
that the computation of the sensitivity is done statically. Usually this implies that we cannot compute the 
precise sensitivity, but only approximate it from above. The goal of this paper is to explore a constraint- 
based methodology in order to compute strict upper bounds on the sensitivity. 

The language we chose to conduct our analysis is relational algebra f?, T], a formal and well defined 
model for relational databases, that is the basis for the popular Structured Query Language (SQL, [2J). 
It consists in a collection of few operators that take relations as input and return relations as output, 
manipulating rows or columns and computing aggregation of values. 

Sensitivity on aggregations often depends on attribute ranges, and these restrictions can be exploited 
to provide better bounds. To this purpose, we extend mechanisms already in place in modem database 
systems: In RDBMS (Relational Data Bases Management Systems) implementations, during the creation 
of a relation, it is possible to define a set of constraints over the attributes of the relation, to further restrict 
the type information. For instance: 

Persons! (Name, String) (Age, Integer)} {Age > A Age < 120)} 

refines the type integer used to express the age of a person in the database, by establishing that it must 
be a positive value smaller than 120. 

Constraints in RDBMS can be defined on single attributes (column constraints), or on several attributes 
(table constraints), and help define the structure of the relation, for example by stating whether an at- 
tribute is a primary key or a reference to an external key. In addition, so called check constraints can be 
defined, to verify the insertion of correct values. In the example above, for instance, the constraint would 
avoid inserting an age of, say, 200. Check constraints are particularly useful for our purposes because 
they restrict the possible values of the attributes, thus allowing a finer analysis of the sensitivity. 

Contribution Our contribution is twofold: 

1 . we propose a method to compute a bound on the sensitivity of a query in relational algebra in a 
compositional way, and 

2. we propose the use of constraints and constraint solvers to refine the method and obtain strict 
bounds on queries which have aggregation functions at the top level. 

Plan of the paper Next section recalls some preliminary notions about relational databases and differ- 
ential privacy. Section [3] introduces a constraint system and the idea of carrying along the information 
provided by the constraints as we analyze the query. Section [4] proposes a generalization of differential 
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privacy and sensitivity to generic metric spaces. This generalization will be useful in order to compute 
the sensitivity of a query in a compositional way. Sections[6j|7]and[8]analyze the sensitivity and the prop- 
agation of constraints for the various operators of relational algebra. Finally Section|9]proposes a method 
to compute a sensitivity bound on the global query, and shows its correctness and the improvement pro- 
vided by the use of constraints. Section 10 discusses some related work, and Section 1 1 concludes. Due 
to space limitations, in this version we have omitted several proof. The interested reader can find them 
in the full online version of the paper [3J. 



2 Preliminaries 

We recall here some basic notions about relational databases and relational algebra, differential privacy, 
and sensitivity. 

2.1 Relational Databases and Relational Algebra 

Relational algebra |@1 21 can be considered as the theoretic foundation of database query languages and 
in particular of SQL f^^. It is based on the concept of relation, which is the mathematical essence of 
a (relational) database, and of certain operators on relations like union, intersection, projections, filters, 
etc.. Here we recall the basic terminology used for relational databases, while the operators will be 
illustrated in detail in the technical body of the paper. 

A relation (or database) based on a certain schema is a collection of tuples (or records) of values. The 
schema defines the types (domain) and the names (attributes) of these values. 

Definition 1 (Relation Scliema). A relation schema r{ai : Di,a2 : D2, . . . , '■ D„) is composed of the 
relation name r and a set of attributes ai,a2, ■ ■ ■ ,a„ associated with the domains Di,D2, ■ ■ ■ ,Dn, respec- 
tively. We use the notation dom{ai) to refer to Dj. 

Definition 2 (Relation). A relation R on a relation schema r{ai : D\,a2 '■ D2, ... ,an : AO is a subset of 
the Cartesian product Di x D2 x . . . x D„. 

A relation is thus composed by a set of n-tuples, where each n-tuple T has the form {d\,d2, . ■ . ,dn) with 
di G Di. Note that T can also be seen as a partial function from attributes to atomic values, i.e. T(a,) = di. 
Given a schema, we will denote the universe of possible tuples by T, and the set of all possible relations 

Relational algebra is a language that operates from relations to relations. Differentially private queries, 
however, can only return a value, and for this reason they must end with an aggregation (operator 7). 
Nevertheless it is possible to show that the full power of relational algebra aggregation can be retrieved. 

2.2 Differential Privacy 

Differential privacy is a property meant to guarantee that the participation in a database does not con- 
stitute a threat for the privacy of an individual. More precisely, the idea is that a (randomized) query 
satisfies differential privacy if two relations that differ only for the addition of one record are almost 
indistinguishable with respect to the results of the query. 

Two relations /?,/?' G ^ that differ only for the addition of one record are called adjacent, denoted by 
R ~ R' . Formally, R ^ R' iff R\R' = {t} or viceversa R'\R = {t}, where T is a tuple. 
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Definition 3 (Differential privacy (6\). A randomized function Ki: Z satisfies e-dijferential pri- 
vacy if for all pairs R,R' & with R ~ R', and all Y <^Z, we have that: 

Pr[Ki{R) E F] < Pr['Ki{R') £Y]-e'' 

where Pr[E] represents the probability of the event E. 

Differentially private mechanisms are usually obtained by adding some random noise to the result of 
the query. The best results are obtained by calibrating the noise distribution according to the so-called 
sensitivity of the query. When the answers to the query are real numbers (M), its sensitivity is defined as 
follows. (We represent a query as a function from databases to the domain of answers.) 

Definition 4 (Sensitivity [6|). Given a query 2 : ^ — t- M, the sensitivity of Q, denoted by Aq, is defined 
as: 

Ae= sup \ Q{R)-Q{R')\. 

The above definition can be extended to queries with answers on generic domains, provided that they are 
equipped with a notion of distance. 

3 Databases with constraints 

As explained in the introduction, one of the contributions of our paper is to provide strict bounds on the 
sensitivity of queries by using constraints. For an introduction to the notions of constraint, constraint 
solver, and constraint system we refer to [JJ. 

In this section we define the constraint system that we will use, and we extend the notion of database 
schema so to accommodate the additional information provided by the constrains during the analysis of 
a query. 

Definition 5 (Constraint system). Our constraint system is defined as follows: 

• Terms are constructed from: 

— variables, ranging over the attribute names of the schemas, 

— constants, ranging over the domains of the schemas, 

— applications of n-ary functions (e.g. +,x) to n terms. 

• Atoms are applications of n-ary predicates to n terms. Possible predicates are >, <, =, S. 

• Constraints are constructed from: 

— atoms, and 

— applications of logical operators f-i, A, V, =j to constraints. 

We denote the composition of constraint by (g). The solutions of a set of constraints C is the set of tuples 
that satisfy C, denoted Sol{C). The relations that can be build from sol{C) are denoted by ^(C) = 
'P{sol{C)). The solutions with respect to an attribute a is denoted sol{C,a). Namely, sol{C,a) is the 
projection on a of sol{C). When the domain is equipped with an ordering relation, we also use inf{C,a) 
and sup{C,a) to denote the infimum and the supremum values, respectively, of sol{C,a). Typically the 
solutions and the inf and sup values can be computed automatically using constraint solvers. Finally we 
define the diameter of a constraint C as the maximum distance between the solutions of C. 

Definition 6 (Diameter). The diameter of a constraint C, denoted diam{C), is the graph diameter of the 
adjacency graph (^(C), ~) of all possible relations composed by tuples that satisfy C. 
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We now extend the classical definition of schema to contain also the set of constraints. 

Definition 7 (Constrained schema). A constrained schema r(A,C) is composed of the relation name r, 
a set of attributes A, and a set of constraints C. A relation on a constrained schema is a subset ofsol{C). 
We will use schema{R) to represent the constrained relation schema of a relation R. 

The above definition extends the notion of relation schema (Definition [T]): In fact here each a, can be 
seen as associated with sol {C, at). Definition [T] can then be retrieved by imposing as only constraints 
those of the form a, G D,. 

Example 1. Consider the constrained schema Items(A,C), where A = {Item, Price, Cost}, and C = 
{(Cost < Price < 1000,0 < Cost < 1000)}. The following R is a possible relation over this schema. 
R: 



Item Price Cost 



Items {Item, Price, Cost} Oil 100 10 

{(Cost < Price < 1000, 0<Cost<1000) } Salt 50 11 

Items(A,C) R 



4 Differential privacy on arbitrary metrics 

The classic notions of differential privacy and sensitivity are meant for queries defined on the set of all 
relations on a given schema. The adjacency relation induces a graph structure (where the arcs correspond 
to the adjacency relation), and a metric structure (where the distance is defined as the distance on the 
graph). 

In order to compute the sensitivity bounds in a compositional way, we need to cope with different struc- 
tures at the intermediate steps, and with different notions of distance. Consequently, we need to extend 
the notions of differential privacy and sensitivity to general metric domains. 
We start by defining the notions of distance that we will need. 

Definition 8 (Hamming distance J//). The distance between two relations R,R' £ !}{^ is the Hamming 
distance dH{R,R') = \R Q R'\, the cardinality of the symmetric difference between R and R'. The sym- 
metric difference is defined as R Q R' = {R\ R') U (/?' \ R). 

Note that dH coincides with the graph-theoretic distance on the graph induced by the adjacency relation 
■~, and that <i//(/?,7?') = 1 ~ We now extend the Hamming distance to tuples of relations, to deal 
with n-ary operators. 

Definition 9 (Distance d„H)- The distance dnH between two tuples ofn relations {Ri,. ■ ■ ,Rn), 
{R[,. . . ,R'„) e is defined as: • • • ,Rn), {R[,- ■ ■ ,K)) = max{dH{RuR\) , ■ • . MRnK)) 

Note that dnH coincides with the Hamming distance for n = 1 . We chose this maximum metric instead of 
other distances because it allows us to compute the sensitivity compositionally, while this is not the case 
for other notions of distance. We can show counterexamples, for instance, for both the Euclidian and the 
Manhattan distances. 

Definition 10 (Distance (i^). The distance between two real numbers x,x' is the usual euclidean 
distance dE{x,x') = |x — x'|. 

In summary, we have two metric spaces over which the relational algebra operators work, namely 

{^",dnH), and {R,dE). 
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Example 2. Consider a relation R and two tuples T, n such that z and K GR. We define its neighbors 
R^ and obtained by adding one record, and by changing one record, respectively: 

R+=RU{r} R'^ =RU{t}\{k} 

Their distance from R is : <i//(/?,/?+) = |/? /?+| = 1, and <i//(/?,/?^) = |/? /?^| = 2. Note also that 

Rr^R+. 

Notation 1. In the following, we will use the notation R^ to denote RU {t} for a generic tuple z, with 
the assumption ( unless otherwise specified) that T ^R. 

We now adapt the definition of differential privacy to arbitrary metric spaces {X,d) (where X is the 
support set and d the distance function). 

Definition 11 (Differential privacy extended). A randomized mechanism Ki: X ^ Z on a generic met- 
ric space {X,d) provides e-differential privacy if for any x,x' G X, and any set of possible outputs Y OZ, 

Pr[:}(_{x) eY] < Pr[-K{x') G Y]-e^<'''''^ 

It can easily be shown that Definitions 1 1 and [3] are equivalent iid = dn- 
We now define the sensitivity of a function on a generic metric space. 

Definition 12 (Sensitivity extended). Let {X,dx) and (Yjdy) be metric spaces. The sensitivity Af of a 
function f : {X,dx) — iYjdy) is defined as 

Again, we can show that Definitions 12 and |4] are equivalent if dx = dnH (proof in full version ll3l). 
This more general definition makes clear that the sensitivity of a function is a measure of how much it 
increases distances from its inputs to its outputs. 

As a refinement of the definition of sensitivity, we may notice that this attribute does not depend on the 
function alone, but also on the domain, where the choice of x, x' ranges to compute the supremum. In our 
framework this is particularly useful because we have a very precise description of the restrictions on the 
domain of an operator, thanks to its input constrained schema (Def |7]l. 

Definition 13 (Sensitivity constrained). Given a function f : {X,dx) — )■ (F,<iy), and a set of constraints 
C on X, the sensitivity off with respect to C is defined as 

A,(C)^ sup '-li^^ 

dx[x,x) 

x+x! 



The introduction of constraints, in addition to an improved precision, allows us to define conveniently 
function composition. It should be noted that when combining two functions fog, where ^ : (F,(iy) — 
{Z,dz), the domain of g actually depends on the restrictions introduced by / and we can take this into 
account maximizing over y,y' G sol{C C/), that is the domain obtained combining the initial constraint 
C and the constraint introduced by /. 
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5 Operators 

We now proceed to compute a bound on the sensitivity of each relational algebra operator through a static 
analysis that depends only on the relation schema the operator is applied to, and not on its particular 
instances. 

From a static point of view each operator will be considered as a transformation from schema to schema 
(instead of a transformation from relations to relations): they may add or remove attributes, and modify 
constraints. 

The following analysis is split in operators op : {iCdnu) {'K.^dn), with n equals 1 or 2, and aggrega- 
tion Yf : (i//) — )• (M, (if). In the sensitivity analysis of the formers, given they work only on Hamming 
metrics, we are only interested in their effect on the number of rows. In our particular case, these re- 
lational algebra operators treats all rows equally, without considering their content. This simplification 
grants us the following property: 

Proposition 1. /fop : {"K^^dn) {^jdn) and C is an arbitrary set of constraints 

Aop(C)= sup ^^^%^^^^?^ = min(Aop(0),J/am(C^Cop)) 

R.R'eX.{C) ^ ' 

(The proposition holds analogously for the binary case). This property, that does not hold for general 
functions, allows us in the case of relational algebra to decouple the computation of sensitivity from the 
constraint system, and solve them separately. Aop(0) (from now on just Aop) can be seen as the sensitivity 
intrinsic to each operator, the maximum value of sensitivity the operator can cause, when the constraints 
are loose enoughj^to be omitted. While diam{C 0Cop), the diameter of the co-domain of the operator, 
limits the maximum distance the operator can produce, that is the numerator in the distances ratio. 



6 Row operators 

In this section we consider a first group of operators of type {'K!\dnH) — ^ {^,dH) with n = 1,2, which 
are characterized by the fact that they can only add or remove tuples, not modify their attributes. Indeed 
the header of the resulting relation maintains the same set of attributes and only the relative constraints 
may be modified. 

6.1 Union U 

The union of two relation is the set theoretic union of two set of tuples with the same attributes. The 
example below illustrates this operation: 

TT ■ . XT • TT ■ . Name Age Height 

Name Age Height Name Age Height 



John 30 180 



John 30 180 U Alice 45 160 - ^. 

Tim 10 100 Tim 10 100 ^™ J 

Alice 45 160 

The union of two relations may reduce their distance, leave it unchanged or in the worst case it could 
double it, so the sensitivity of union is 2. 

Proposition 2. The union has sensitivity 2: Ay = 2. 



for all possible domains the sensitivity can't be greater. 
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Proof. If d2H{{Ri,R2), (^3,^4)) = 1 then we have two cases 

a) 7?3 = R^,R4 = R2 or Rj = Ri,R4 = /?2 ■ For the symmetry of distance only one case needs to be 
considered: 

TG/?2 



{R1UR2) e (Rt^Ri) 



1 o.w. 



The only difference is the tuple T. If T G R2 then T would be in both results, leading to identical 
relations, thus reducing the distance to zero. If T /?2 then T will again be the only difference 
between the results, thus resulting into distance 1. 

b) R3=Rt,R4=R2 

{0 Ti G /?2 A T2 G /?i 
1 Ti G /?2 V T2 G /?i 
2 Ti ^/?2AT2 

In this case we have two records differing, Ti and T2, and in the worst case they may remain different in 
the results, giving a final sensitivity of 2 for the operator. □ 

Definition 14 (Constraints for union). Let schema{R]) = (A,Ci) and schema{R2) = {A, €2)- Then 
schema{Ri UR2) = (A,Ci VC2). 

6.2 Intersection fi 

The intersection of two relation is the set theoretic intersection of two set of tuples with the same at- 
tributes. 

As for the union, the intersection applied to arguments at distance 1 may result in a distance 0, 1 or 2. 
Proposition 3. The intersection has sensitivity 2: Ap = 2. 

Proof. Similar to the case of Proposition |2] □ 

Definition 15 (Constraints for intersection). Let schema{R\) = (A,Ci) and schema{R2) = (A,C2). 
Then schema{R\ n/?2) = (a,Ci AC2). 

6.3 Difference \ 

The difference of two relation is the set theoretic difference of two set of tuples with the same attributes. 
As in the case of the union, the difference applied to arguments at distance 1 may result in a distance 0, 
1 or 2. 

Proposition 4. The set difference has sensitivity 1: A\ = 2. 

Proof. Similar to the case of Proposition |2] □ 

Definition 16 (Constraints for set difference). Let schema{R\) = (A,Ci) and schema{R2) = (A,C2). 
Then schema{R\ \/?2) = (A,Ci A (-1C2)). 
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6.4 Restriction a 

The restriction operator Ocp{R) removes all rows not satisfying the condition <p (typically constructed 

using the predicates =, 7^, <, > and the logical connectives V, A, -■), over a subset of R attributes. 

As an example, consider the following SQL program that removes all people whose age is smaller than 20 

or whose height is greater than 180. The table illustrates an example of application of the corresponding 

restriction (7Age>20/\Height<m- 

SELECT * 

FROM R 

WHERE Age>=20 AND Height<=180 



<^Age>20AHeight<m) 



/ Name Age Height \ 

John 30 TSO 

Tim 10 100 

Alice 45 160 

V Natalie 20 175 / 



Name Age Height 
Alice 45 160 
Natalie 20 175 



The restriction can be expressed in terms of set difference: <J(p{R) = R\{t \ -'(p{t)}. However the 
sensitivity is different because the operator is unary, the second argument is fixed by the condition <p 

Proposition 5. The restriction has sensitivity 2: A^^ = 1. 

Definition 17 (Constraints for restriction). Let schema{R) = (A,C) and A' C A. Then define 
schema{o^[A'){R)) = (^)C A <p(A')). 



7 Attribute operators 

The following set of operators, unlike those analyzed so far, can affect the number of tuples of a relation, 
as well as its attributes. 



7.1 Projection 7t 



The projection operator nai,...,an{R) eliminates the columns of R with attributes other than ai, . . . ,a„, and 
then deletes possible duplicates, thus reducing distances or leaving them unchanged. It is the opposite of 
the restricted Cartesian product x 1 which will be presented later. 

The following example illustrates the use of the projection. Here, the attribute to preserve are Name and 
Age. 



SELECT Name, Age 
FROM R 



^Name^ge 



/ Name Age Car \ 

John 30 Ford 

John 30 Renault 

\ AUce 45 Fiat / 



Name Age 
John 30 
Alice 45 



Proposition 6. The projection has sensitivity 1; A;j = 1. 

l". «.W e ,.iR*)l = {1 3P^^«-W.{1,...,"}P(.,) = <.,) 

Definition 18 (Constraints for projection). Let schema{R) = (A,C) and A' C A. Then 
schema{7tA'{R)) = {A',C). 



□ 



C. Palamidessi & M. Stronati 



101 



7.2 Cartesian product 

The Cartesian product of two relation is the set theoretic Cartesian product of two set of tuples with 

different attributes, with the exception that in relations the order of attributes does not count, thus making 
the operation commutative. The following example illustrate this operation. 



Name Age Height Car Owner 



Name 


Age 


Height 


Car 


Owner 


John 


30 


180 


Fiat 


Alice 


John 


30 


180 X 


Fiat 


Alice 


= John 


30 


180 


Ford 


Alice 


Alice 


45 


160 


Ford 


Alice 


AUce 


45 


160 


Fiat 


Alice 












AUce 


45 


160 


Ford 


Alice 



This operator may seem odd in the context of a query language, but it is in fact the base of the join, the 
operator to merge the information of two relations. 

R M T = GR.a,=T.a,{RxT) 
R.ai=T.ai 

We analyze now the sensitivity of the Cartesian product. 

One record x i We first consider a restricted version x i, where on one side we have a single tuple. 
Proposition 7. The operator Xi has sensitivity 1; Axj = 1. 

N records x We consider now the full Cartesian product operator. It is immediate to see that a dif- 
ference of a single row can be expanded to an arbitrary number of records, thus causing and unbounded 
sensitivity. 

Proposition 8. The ( unrestricted) Cartesian product has unbounded sensitivity. 
We now define how constraints propagate through Cartesian product: 

Definition 19 (Constraints for product). Let schema{R\) = (Ai,Ci) and schemaiRi) = (A2,C2). Then 
schema{R\ XR2) = (Ai UA2,Ci AC2). 

7.3 Restricted x 

The effect of Cartesian product is to expand each record with a block of records, a behavior clearly against 
our objective of distance-preserving computations. However we propose some restricted versions of the 
operator in order to maintain its functionahty to a certain extent: 

• x„: product with blocks of a fixed n size, to obtain n sensitivity. In this case n representative 
elements can be chosen from the relation, the definition of policies to pick these elements is left to 
future developments. 

• X y: a new single record is built as an aggregation of the relation, through the operator 0// (pre- 
sented later), thus falling in the case of x 1 sensitivity. 

• a mix the two approaches could be considered, building n aggregations, possibly using the operator 
{ai}7f (presented later). 

In both approaches the rest of the query can help to select the right records from the block, for example 
an external restriction could be anticipated. 
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8 Aggregation 7 

The classical relational algebra operator for aggregation {ai....,a„} 7 ...,/*} (^) performs the following 
steps: 

• it partitions R, so that each group has all the tuples with the same values for each a/, 

• it computes all /,■ for each group, 

• it returns a single tuple for each group, with the values of a, and of /,. 

The most common function founds on RDBMS are count , max, min, avg, sum and we will restrict our 
analysis to these ones. The following example illustrates how we can use an aggregation operator to 
know, for each type of Car, how many people own it and what is their average height. 

SELECT Car, Count(*), Avg(Height) 
FROM R 
GROUPBY Car 



{Car} y{Count,Avg{H eight)} 



/ Name 
Alice 
John 



Age Height Car 



45 
30 

Frank 45 
V Natahe 20 



160 
180 
165 
170 



Ford 
Fiat 
Renault 
Ford 



Car 



Count Avg(Height) 



Ford 2 
Fiat 1 
Renault 1 



165 
180 
165 



In the domain of differential privacy special care must be taken when dealing with this operator as it is 
in fact the point of the query in which our analysis of sensitivity ends and the noise must be added to the 
result of the function application. 

A differentially private query should return a single value, in our case in M, and the only queries that 
statically guarantee this property are those ending with the operator q/j : — (MjJ^) (from here 

on abbreviated 7/), that apply only one function / to the whole relation without grouping. For this reason 
we will ignore grouping for now, and focus on queries of the form 07/(2) where 2 is a sub-query without 
aggregations. It is however possible to recover the original aYf behavior and use it in sub-queries. 



8.1 Functions 

In this section we analyze the sensitivity of the common mathematical functions count , sum, max, min 
and avg. The application of functions coincide with the change of domain, in fact they take as input a 
relation in (i^,^///) and return a single number in (M, cIe), (not to be confused with a relation with a single 
tuple, which also contains a single value). 

Extending standard results |6|, we can prove that, when / = count , sum, max, min, avg then Af{C) can 
be computed as follows: 

Proposition 9. 

Amax„,(C) = |sup(C,a;)-inf(C,a,-)| 
Amin„.(C) = |sup(C,fl;i)-inf(C,a,-)| 



'^count (C) 
'^sum„. (C) 



1 



max{| sup(C,fl',)|, I inf(C,fl;,)|} 



■JC) 



I sup(C,a,) — inf(C,a,- 
2 



8.2 Exploiting the constraint system 

The sensitivity of aggregation functions, as shown above, depends on the range of the values of an 
attribute, so clearly it is important to compute the range as accurately as possible. 
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The usual approach is to consider the bounds given by the domain of each attribute. In terms of constraint 
system, this corresponds to consider the solutions of the constraint Cj = a\ G Di Aa2 G ^2 A . . . Aa„ G D„. 
I.e. the standard approach computes the sensitivity of aggregation functions for an attribute a on the basis 
of sup(C/,a) and inf(C/,a). 

In our proposal we also use Q: for us it is the initial constraint, at the beginning of the analysis of the 
query. The difference is that our approach updates this constraints with information provided by the 
various components of the query, and then exploits this information to compute more accurate ranges for 
each attribute. The following example illustrates the idea. 

Example 3. Assume that schema{R) = {{Weight , Height} ,Ci), andthat the domain for Weight is [0, 150] 
and for Height is [0,200]. The following query asks the average weight of all the individuals whose 
weight is below the height minus 100. 

'Yavg{Weight) ( (^Weight <H eight - 1 00 (^) ) 

Below we show the initial constraint Cj and the constraints Cq computed by taking into account the 
condition of O. Compare the sensitivity computed using C/ with the one computed using Cq: They differ 
because in Cq the max value of Weight is 100, while in Ci is 150. 

Ci = {We [0, 150] AHe [0,200]} MQ, y,„^(iy)) = \'«^iCj,W)-min{Q,W)\ ^ 75 

Cq = {W e [0,150] A// £[0,200] AW<H-m} A(CQ,y^„g(ty)) = ''"'"^^^■'^^''"'"''^g-'^'l =50 

Hence exploiting the constraints generated by the query can lead to a significant reduction of the sensi- 
tivity. 

8.3 Constraints generated by the functions 

We now define how to add new constraints for the newly created attributes computed by the functions. 

Definition 20 (Constraints for functions). Let schema{R) = (A,C), A' C A and 

F = {/i (ai), . . .,/«(««)}, where ai,. .. ,an £ A. Then schema^A' 7 f (^)) = (A' U . . .a/,,}, C /\Cf^ /\ 

... AcfJ, where: 

_( min{C,ai)< af. <sup{C,ai) if f = max/min/avg 
\ 0< af. if fi = sum / count 

9 Global sensitivity 

We have concluded the analysis for all operators of relational algebra, and we now define the sensitivity 
of the whole query in a compositional way. 

For the computation of the sensitivity, we need to take into account the constraint generated by it. We 
start by showing how to compute this constraint, in the obvious (compositional) way. Remember that we 
have already defined the constraints generated by each relational algebra operator in Sections [6j [7] and [8] 

Definition 21 (Constraint generated by an intermediate query). The global constraint generated by 
an intermediate query Q on relations with relational schema r{ai : Di,a2,D2, . . ■ ,an '■ D„) is defined 
statically as: 

Cq = schema{Q{R)) 

where R is any relation such that schema{R) = ({ai,a2, • • • ,a„},C/), with Ci = a\ € Di Aa2 G D2 A . . . A 

ayi G T)yi. 
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We assume, the top-level operator in a query is an aggregation Yf, followed by a query composed freely 
using the other operators. We now show how to compute the sensitivity of the latter. Since it is a recursive 
definition, for the sake of elegance we will assume an identity query Id. 

Definition 22 (Intermediate query sensitivity). Assume op : {^",d„H) — ^ {^,dH) and Qp the con- 
straint obtained after the application of op.- 

S{Id) = mm{l, diam{Cid)) base case 

5(opog) = min [Aop ■S{Q),diam{CopoQ)) ifn=\ 

S{o^o{Q\,Q2)) = min (Aop •max(5(ei), 5(22)), fi?wm(Copo(2i,e2))) ifn = 2 

where op can be any o/U,n, \, a, 71, x, x 1 and the (classic) aJf- 
We are now ready to define the global sensitivity of the query: 

Definition 23 (global sensitivity). The global sensitivity GS of a query Yf{Q) is defined as: 

{Af{CQ)-S{Q) //■/= count, sum, avg 
Ay(Cg) if f = max, min 

The following theorem, (proof in full version [3 |), expresses the soundness and the strictness of the 
bound computed with our method. 

Theorem 1 (Soundness and strictness). The sensitivity bound computed by GS{-) is sound and strict. 
Namely: 

GSiYfiQ)) = A,^(e) 

10 Related Work 

The field of privacy in statistical databases has often been characterized by ad-hoc solutions or algorithms 
to solve specific cases [7 1 . In recent years however there have been several efforts to develop a general 
framework to define differentially private mechanisms. In the work |[T2]| the authors have proposed 
a functional query language equipped with a type system that guarantees differential privacy. Their 
approach is very elegant, and based on deep logical principles. However, it may be a bit far from the 
practices of the database community, addressing which is the aim of our paper. 

The work that is closest to ours, is the PINQ framework ifTTl . where McSherry extends the LINQ lan- 
guage, with differential privacy functionalities developed by himself, Dwork and others in ifTOl . 
Despite this existing implementation we felt the need for a more universal language to explore our ideas, 
and the mathematically-based framework of relational algebra seemed a natural choice. Furthermore the 
use of a constraint system to increase the precision of the sensitivity bound was, to out knowledge, never 
explored before. 

11 Conclusions and future work 

We showed how a classical language like relational algebra can be a suitable framework for differential 
privacy and how technology already in place, like check constraints, can be exploited to improve the 
precision of our sensitivity bounds. 
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Our analysis showed how the most common operation on databases, the join cxi, poses great privacy prob- 
lems and in future we hope to develop solutions to this issue, possibly along the lines already presented 
in Section 1731 

In this paper we have considered only the sensitivity, that is the effect on distances of operators, while 
another interesting aspect would be to compute the effect on the £ exponent as explored in JTTl . and 
possibly propose convenient strategies to query as much as possible over disjoint data sets. 
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